System and method for conditional multi-output regression for machine condition monitoring

ABSTRACT

A method for predicting sensor output values of a sensor monitoring system, includes providing a set test input values to a system of sensors, and one or more known sensor output values from the sensor system, where other sensor output values are unknown, calculating, for each unknown sensor output value, a predictive Gaussian distribution function from the test input values and the known output sensor values, and predicting each unknown output y m  by integrating over a product of the predictive Gaussian distribution function and a conditional Gaussian distribution of the unknown output sensor values with respect to the test input values and other unknown output sensor values. A mean and covariance of the predictive Gaussian distribution function are determined from a training phase, and a hyperparameter of the conditional Gaussian distribution are determined by another training phase.

CROSS REFERENCE TO RELATED UNITED STATES APPLICATIONS

This application claims priority from “Conditional Multi-OutputRegression For Machine Condition Monitoring”, U.S. ProvisionalApplication No. 61/388,127 of Chao Yuan, filed Sep. 30, 2010, thecontents of which are herein incorporated by reference in theirentirety.

TECHNICAL FIELD

This disclosure is directed to methods for monitoring the condition of amachine based on sensor output.

DISCUSSION OF THE RELATED ART

Monitoring the condition of an expensive device or machine, such as apower plant or an airplane, has received increasing interest in recentyears. The goal is to detect failures of these machines at an earlystage to avoid subsequent catastrophic damage. This may be achieved byanalyzing the values from a set of sensors installed in different partsof a machine. When correlations among the sensor values are violated,there is likely to be a failure. One important aspect of modeling suchcorrelations is the ability to accurately predict process outputs basedon process inputs. This naturally forms a multi-output regression thataims to learn a mapping from the input space to a M-dimensional outputspace.

Multi-output regression aims at learning a mapping from the input spaceto an M-dimensional output space. Given the fact that outputs areusually dependent to each other, previous studies have focused onmodeling the correlation or the joint predictive distribution ofoutputs. The learned joint predictive distribution can then be appliedto a wide variety of problems.

The following conditional multi-output regression is of interest. For atest input, if some outputs are additionally known, how can this extrainformation be used to improve a prediction on the remaining outputs?For example, at a geographic location, given the concentration of ametal that is less expensive to measure, can one estimate theconcentration of another metal? In financial markets, does the earningrelease of company A help to better forecast the earning release ofcompany B? In many modern monitoring systems, sensor values aretransferred in real-time from the machine to the diagnostic center. Butthey often come sequentially instead of simultaneously due to networkissues.

Can these available sensor values be used to help predict other missingsensor values? If the inputs are a Markov blanket of outputs, forexample, in FIG. 1( a) where there are two outputs y₁ and y₂ conditionedon inputs x, there is no advantage of doing so. This is because given x,y₁ and y₂ are conditionally independent such that P(y₁|x, y₂)=P(y₁|).Thus, using other output y₂ as inputs does not help. However, someinputs may be hidden as shown in FIG. 1( b), where input z is neverobserved. This is a more realistic scenario, because it is challengingto measure all inputs in real datasets. In such cases, y₂ carriesinformation about the missing z and is expected to improve theprediction of y₁ if used as an input.

Previous approaches usually tackle this task by inferring the unknownoutputs conditionally from the known outputs based on their jointpredictive distribution. However, learning the joint predictivedistribution is quite challenging, especially when the regressionmapping is nonlinear. Because multi-output regression can be viewed as aspecial case of multi-task learning, when regression tasks share thesame inputs, many multi-task techniques are also applicable tomulti-output regression tasks. However, most of these techniques focuson sharing representation and parameters among M single-output tasks. Inprediction, all single-output models work independently withoutconsidering their correlations.

SUMMARY OF THE INVENTION

Exemplary embodiments of the invention as described herein generallyinclude methods and systems for conditional multi-output regression. Amethod according to an embodiment of the invention includes two models.In a conditional model according to an embodiment of the invention,given M outputs, each output is dependent on the inputs and all otherM−1 outputs. By doing this, the other outputs may be viewed the same asthe inputs, and thus the original multi-output task may be broken intosimpler single-output tasks. If all of the other M−1 outputs are known,this conditional model alone gives prediction of the target output.Otherwise, generative model according to an to embodiment of theinvention may be used to infer unknown outputs based on inputs, anduncertainties are then propagated to the conditional model to make afinal prediction. Note that the terms “conditional” and “generative” arewith respect to outputs, not inputs.

A framework according to an embodiment of the invention is very general.Various existing regression techniques can be used in the conditionalmodel according to other embodiments of the invention. An even broaderrange of algorithms, as long as they provide error bars for theirpredictions, can be used for the generative model according to otherembodiments of the invention.

According to an aspect of the invention, there is provided a method forpredicting sensor output values of a sensor monitoring system, includingproviding a set of one or more test input values to a system of sensors,and one or more known sensor output values from the sensor system, whereother sensor output values are unknown, calculating, for each unknownsensor output value, a predictive Gaussian distribution function

$\begin{matrix}{{P( y_{U} \middle| x )} = {\prod\limits_{m \in U}^{\;}\; {P( y_{m} \middle| x )}}} \\{= {\frac{1}{( {2\; \pi} )^{d/2}{S_{y_{U}}}^{1/2}}{\exp ( {{- \frac{1}{2}}( {y_{U} - \mu_{y_{U}}} )^{T}{S_{y_{U}}^{- 1}( {y_{U} - \mu_{y_{U}}} )}} )}}}\end{matrix}$

from the test input values, where vector x of dimensionality drepresents the test input values, vector y_(U) of dimensionality Mrepresents the set of unknown output sensor values, y_(m)εy_(U), μ_(y)_(U) is a vector of mean values of the set y_(U) determined by atraining phase, S_(y) _(U) is a diagonal covariance matrix of the sety_(U) determined by the training phase; and predicting each unknownoutput y_(m) from P(y_(m)|x, y_(O)=∫_(y) _(U m) P(y_(m)|x, y_(U m))P(y_(U)|x)dy_(U m) , where vector y_(O) represents the known sensoroutput values, vector y_(U m) represents unknown output sensor values iny_(U) except y_(m), and P(y_(m)|x, y_(U m) ) is a conditional Gaussiandistribution defined by log P(y_(m)|x, y_(U m) )=−½ log|K|−½ y_(U m)K⁻¹y_(U m) ^(T)+C, where C=−(0.5 d)log(2π) and K is an N×N kernel matrixdefined between pairs of test input values x_(i), x_(j) where N is thenumber of test input values whose elements K_(i,j) are defined byGaussian kernel functions K_(i,j)=k(x_(i),x_(j)|Λ)=exp(−½(x_(i)−x_(j))^(T)Λ⁻¹(x_(i)−x_(j))), where Λ=diag[λ₁ ², .. . , λ_(d) ²]^(T) whose values λ_(i) are determined by another trainingphase.

According to a further aspect of the invention, the predictive Gaussiandistribution function P(y_(m)|x) for each unknown output y_(m) istrained by maximizing log P(Y_(m)|X_(m), θ)=−½ log|K|−½Y_(m)K⁻¹Y_(m)^(T)+C with respect to a hyperparameter θ={λ₁, . . . , λ_(d)}, whereY_(m) is a set of N_(m) training samples of dimensionality M for sensoroutput values corresponding to a set X_(m) of N training samples forsensor input values.

According to a further aspect of the invention, log P(Y_(m)|X_(m), θ)=−½log|K|−½Y_(m)K⁻¹Y_(m) ^(T)+C is maximized using a conjugate gradientmethod.

According to a further aspect of the invention, the conditional Gaussiandistribution P(y_(m)|x, y_(U m) , y_(O)) is trained by maximizing logP(Y_(m)|X, Y _(m) , θ)=−½ log|K|−½y_(U m) K⁻¹y_(U m) ^(T)+C with respectto a hyperparameter θ={λ₁, . . . , λ_(d+M-1)}, for the training set ofinput values X and output values Y, where Y _(m) is a (M−1)×N matrixthat represents training sets for all outputs except y_(m), where theGaussian kernel functions K_(i,j)=k(z_(i),z_(j)|Λ)=exp(−½(z_(i)−z_(j))^(T)Λ⁻¹(z_(i)−z_(j))) where z is a (d+M−1)input vector that combines vectors xεX and y _(m) εY _(m) .

According to a further aspect of the invention, the method includes,when there are N−N_(n) missing values in the set of output values Y_(m),sampling a plurality of output values from the predictive Gaussiandistribution P(y_(m)|x) for each missing value in Y_(m), and maximizingan average over the plurality of sampled values

$\frac{1}{T}{\sum\limits_{t = 1}^{T}{\log \; {P( { Y_{m}^{(t)} \middle| X_{m} ,Y_{\overset{\_}{m}}^{(t)},\theta} )}}}$

with respect to the hyperparameter θ, where T is the number of theplurality of sampled values for each missing value in Y_(m), where eachmissing value in Y_(m) is replaced by a sampled value.

According to a further aspect of the invention, the method includesrepeatedly calculating

${Q^{({l + 1})}( y_{m} )} = {\int_{y_{U\overset{\_}{m}}}{{P( { y_{m} \middle| x ,y_{\overset{\_}{m}}} )}{\prod\limits_{i \in U_{\overset{\_}{m}}}^{\;}\; {{Q^{(l)}( y_{i} )}{y_{U\overset{\_}{m}}}}}}}$

until convergence, where each Q⁽¹⁾(y_(m)) is initialized asQ⁽¹⁾(y_(m))=P(y_(m)|x, y_(O))=∫_(y) _(U m) P(y_(m)|x, y_(U m))P(y_(U)|x)dy_(U m) .

According to a another aspect of the invention, there is provided aprogram storage device readable by a computer, tangibly embodying aprogram of instructions executable by the computer to perform the methodsteps for predicting sensor output values of a sensor monitoring system.

BRIEF DESCRIPTION OF THE DRAWINGS

FIGS. 1( a)-(b) illustrates when outputs can be used as extra inputs,according to an embodiment of the invention.

FIGS. 2( a)-(b) graphically represents a modeling framework according toan embodiment of the invention.

FIG. 3 is a flowchart of an iterative method for conditionalmulti-output regression, according to an embodiment of the invention.

FIG. 4 is a table of the primary and secondary variables for the threetests on the Jura dataset, according to an embodiment of the invention.

FIG. 5 is a table of the MAE results for different methods for the Juradata test under the inductive setting, according to an embodiment of theinvention.

FIG. 6 is a table of the MAE results for different methods for the Juradata test under the transductive setting, according to an embodiment ofthe invention.

FIG. 7 shows the standardized MSE results for different methods in thePuma560 test, according to an embodiment of the invention.

FIGS. 8( a)-(b) illustrate the truck data test, according to anembodiment of the invention.

FIG. 9 is a block diagram of an exemplary computer system forimplementing a method for conditional multi-output regression, accordingto an embodiment of the invention.

DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS

Exemplary embodiments of the invention as described herein generallyinclude systems and methods for conditional multi-output regression.Accordingly, while the invention is susceptible to various modificationsand alternative fauns, specific embodiments thereof are shown by way ofexample in the drawings and will herein be described in detail. Itshould be understood, however, that there is no intent to limit theinvention to the particular forms disclosed, but on the contrary, theinvention is to cover all modifications, equivalents, and alternativesfalling within the spirit and scope of the invention.

Models:

A standard multi-output regression can learn the mapping P(y|x) frominputs xεR^(d) to outputs yεR^(M). In a conditional multi-outputregression according to an embodiment of the invention, one isadditionally given some observed outputs y_(O). According to anembodiment of the invention, it is desired to predict y_(m), each of theremaining unknown outputs y_(U):

P(y _(m) |x,y _(O))  (1)

Both O and U are index sets, forming a partition of the full index set{i}_(i=1) ^(M). Equivalently, one has y={y_(U), y_(O)}. The notation U_(m) ={i|iεU, i≠m} may be used herein below to denote the index set Uexcluding m. y_(U m) thus represents all outputs in y_(U) except y_(m).Similarly, m={i|1≦i≦M, i≠m} indicates the full index set excluding msuch that y _(m) ={y_(U m) , y_(O)}. Note that for a regression (vs. adensity estimation), there is no need to calculate the predictive jointdistribution of P(y_(U)|x, y_(O)) (or consider the predictive covarianceamong outputs y_(U)) because the goal of regression is to give anestimate of each y_(m) close to its ground truth value.

In prior single generative models, the joint predictive distributionP(y|x) is learned, which solves a more general task. Once P(y|x) isknown, such as a multivariate Gaussian distribution, the full jointpredictive distribution of P(y_(U)|x, y_(O)) can be derived directly.However, learning P(y|x) is quite challenging. Without a sparseapproximation, popular prior methods require a formidable complexity ofO(M³N³), and the learned distribution is often inaccurate.

It is tempting to ask the following question: can one solve thisregression task only using standard regression algorithms? Astraightforward way is to learn a regression model for every P(y_(m)|x,y_(O)) during training. However, because one does not know beforehandwhich outputs will be unknown or available during testing, all possiblescenarios need to be considered during training. This requires a totalof M×2^(M-1) predictors, which soon becomes intractable as M increases.

According to an embodiment of the invention, a two-model frameworktransforms the original multi-output regression task into 2Msingle-output regression tasks. Specifically, to predict each unknownoutput y_(m), where mεU, a Bayesian integration is performed over otherunknown outputs y_(U m) :

P(y _(m) |x,y _(O))=∫_(y) _(U m) P(y _(m) |x,y _(U m) )P(y _(U m) |x,y_(O))dy _(U m) .  (2)

In a conditional model according to an embodiment of the invention, onedirectly models P(y_(m)|x, y _(m) ), the first factor in the integrandof EQ. (2). A total of M conditional probabilities are specified, witheach output y_(m), conditioned on inputs x and the remaining outputs y_(m) .

In a generative model according to an embodiment of the invention,P(y_(U m) |x, y_(O)), the second factor in the integrand of EQ. (1) ismodeled. The generative model is used to infer other unknown outputs.The uncertainty of other outputs is then propagated to the conditionalmodel through the Bayesian integration to make the final prediction.

Previous generative approaches that give joint predictive distributionare straight candidates for a generative model according to anembodiment of the invention. However, due to above noted limitations, anindependent generative model is considered, in which, given the inputs,all outputs are independent. The second part in the integrand of EQ. (2)now becomes:

$\begin{matrix}{{P( { y_{U\overset{\_}{m}} \middle| x ,y_{O}} )} = {\prod\limits_{i \in {U\overset{\_}{m}}}^{\;}\; {{P( y_{i} \middle| x )}.}}} & (3)\end{matrix}$

y_(O) is omitted on the right side of EQ. (3) because it is independentof y_(U m) given inputs x. A generative model according to an embodimentof the invention therefore comprises M regression predictors, eachpredicting an output from inputs. This is computationally more efficientthan a full generative model. As will be shown, even with such a simplemodel, a method according to an embodiment of the invention can achievesignificant performance improvements. In addition, an iterativealgorithm according to an embodiment of the invention uses this simplegenerative model as an initial approximation to P(y_(U m) |x, y_(O)),and refines this approximation through iterations

FIGS. 2( a)-(b) graphically represents a framework according to anembodiment of the invention, where the arrows indicate dependency. In aconditional model according to an embodiment of the invention,illustrated in FIG. 2( a), the dependency of each output is modeled onboth the inputs and the remaining M−1 outputs. In a generative modelaccording to an embodiment of the invention, illustrated in FIG. 2( b),the dependency of each output is modeled only on the inputs.

According to an embodiment of the invention, Gaussian processes are usedas the basic predictors for each of the 2M regression tasks. Giventraining data X and Y, containing N training samples for inputs x andoutput y, respectively, to learn the mapping from x to y, a Gaussianprocess assumes that Y has a Gaussian distribution of

$\begin{matrix}\begin{matrix}{{P( Y \middle| X )} = {N( { Y \middle| 0 ,{K + {\sigma^{2}I}}} )}} \\{{= {\frac{1}{( {2\; \pi} )^{d/2}{{K + {\sigma^{2}I}}}^{1/2}}{\exp ( {{- \frac{1}{2}}{Y^{T}( {K + {\sigma^{2}I}} )}^{- 1}Y} )}}},}\end{matrix} & (4)\end{matrix}$

where Y is a N-dimensional vector of output training samples, X is avector of d-dimensional input training samples x_(i), where i=1, N, I isan identity matrix and σ² is the prediction noise variance. The kernelmatrix K considered here is an N by N matrix that comprises Gaussiankernel functions between pairs of inputs X_(i) and X_(j):

k(x _(i) ,x _(j)|Λ)=exp(−½(x _(i) −x _(j))^(T)Λ⁻¹(x _(i) −x _(j))),  (5)

where Λ=diag[λ₁ ², . . . , λ_(d) ²]^(T) includes squared length scales,allowing automatic relevance determination of each input dimension topredict the output y_(m). The hyperparameter θ of the Gaussian processis thus {λ₁, . . . , λ_(d), σ}.

In general, the Gaussian form for P(y|x) may be written as

$\begin{matrix}\begin{matrix}{{P( y \middle| x )} = {N( { y \middle| {\sum\limits_{n = 1}^{N}{\beta_{n}{k( {x,x_{i}} )}}} ,{\gamma^{2}I}} )}} \\{= {\frac{1}{( {2\; \pi} )^{1/2}{\gamma^{2}}^{1/2}}{\exp ( {{- \frac{1}{2\; \gamma^{2}}}( {y - {\sum\limits_{n = 1}^{N}{\beta_{n}{k( {x,x_{i}} )}}}} )^{2}} )}}}\end{matrix} & (6)\end{matrix}$

where and β_(n) γ² are constant parameters determined by the trainingprocess.

Once both a conditional model and a generative model according toembodiments of the invention have been learned, EQ. (2) can beevaluated. The choice of using Gaussian kernel functions is useful forallowing EQ. (2) to be analytically evaluated. A variety of otherprobabilistic regression algorithms can be used in other embodiments ofthe invention, as long as they use Gaussian kernel functions that makethe Bayesian integration tractable. An exemplary, non-limiting list ofsuch techniques includes a relevance vector machine, Bayesian supportvector regression, and Gaussian processes.

Training:

There are a total of 2M Gaussian processes to be learned with Mprocesses for a generative model according to an embodiment of theinvention and the other M processes for a conditional model according toan embodiment of the invention. Training the models breaks down totraining each of the 2M Gaussian process predictors independently.Suppose that there are N training samples for the inputs, denoted byX={x_(n)}_(n=1) ^(N), and N_(m) training samples for the outputs y_(m),denoted by Y_(m)={y_(m,n)}_(n=1) ^(N) ^(m) . Each y_(m) corresponds toN_(m) input training samples, denoted by X_(m)={x_(m,n)}_(n=1) ^(N) ^(m), where X_(m) ⊂X.

Two cases may be considered. In the first case, the training data arecomplete: X_(m)=X and N_(m)=N for all outputs. In other words, for everyinput sample X_(n) and for every output y_(m), there is a correspondingtraining sample y_(m,n). Thus, all input and output training data form a(d+M)×N matrix. In the second case, the training data contain missingentries: some outputs y_(m) may only correspond to its particular inputtraining set X_(m) instead of the common input training set X. In thiscase, the size of Y_(m) or N_(m)<N. This is equivalent to say that thereare N−N_(m) missing values in y_(m).

A. Training the Generative Model

For a generative model according to an embodiment of the invention,learning the Gaussian process for predicting output y_(m) is equivalentto learning the hyperparameter vector θ which maximizes the followinglog likelihood:

log P(Y _(m) |X _(m),θ)=−½ log|K|−½Y _(m) K ⁻¹ Y _(m) ^(T) +C,  (7)

where the covariance matrix K contains θ and C is a constant term.According to an embodiment of the invention, C=−(0.5 d) log(2π), where dis the dimensionality of the input samples. According to an embodimentof the invention, a conjugate gradient method may be used to search forθ.

B. Training the Conditional Model

According to an embodiment of the invention, the first case assumes thatthe training data are complete (X_(m)=X). In the conditional model, eachpredictor has a new log likelihood to be maximized:

log P(Y _(m) |X,Y _(m) ,θ)  (8)

where Y _(m) is a (M−1)×N matrix, representing the training sets for alloutputs except y_(m). EQ. (8) is very similar to EQ. (7) except that anew (d+M−1)-dimensional input vector z combining x and y _(m) is used,There are thus d+M−1 length scales in the corresponding Λ and θ. Here,the P(y_(m)|x, y _(m) , θ) may be represented by the Gaussian of EQ.(6), with Gaussian kernel k(z_(i), z_(j)|Λ)=exp(−½(z_(i)−z_(j))^(T)Λ⁻¹(z_(i)−z_(j))) where Λ=diag[λ₁ ², . . . ,λ_(d+m-1) ²]^(T).

According to an embodiment of the invention, the second case considersthe missing data. In this case, for some output y_(m), there are N−N_(m)missing values in y_(m). According to an embodiment of the invention,these missing entries can be sampled using the already learnedgenerative model. Specifically, for every missing entry in every Y_(m),the corresponding predictive distribution from the generative model canbe used to sample a value and put it at the missing position in Y_(m).Such sampling is repeated T times. An exemplary, non-limiting value forT is 10. Now there are T sets of output training data, each denoted by{Y_(m) ^((t))}_(m-1) ^(M), where t=1, 2, . . . ; T. Every Y_(m) ^((t))now contains N values. We also use Y _(m) ^((t)) to denote the t-th setof output training data excluding Y_(m) ^((t)). The log likelihood isnow maximized with an average evidence term from T sampled regressionproblems

$\begin{matrix}{\frac{1}{T}{\sum\limits_{t = 1}^{T}{\log \; {{P( { Y_{m}^{(t)} \middle| X_{m} ,Y_{\overset{\_}{m}}^{(t)},\theta} )}.}}}} & (9)\end{matrix}$

Gaussian process has a training complexity of O(N³). Training 2MGaussian processes to has a total cost of O(MN³), which is significantlysmaller than O(M³N³) needed by previous generative approaches, but canstill be high for a large N. According to embodiments of the invention,various sparse Gaussian processes can achieve a smaller trainingcomplexity of O(κ²N)(κ<<N), which can replace a Gaussian process used ina model according to an embodiment of the invention. An exemplary,non-limiting list of such techniques includes sparse online Gaussianprocesses, sparse greedy Gaussian process regression, and sparseGaussian processes using pseudo-inputs.

Prediction: A. A One-Pass Algorithm

In prediction, given the test inputs x and additional known outputsy_(O), the task is to predict each unknown y_(m) using EQ. (2).According to an embodiment of the invention, one first calculates thepredictive distribution for all unknown outputs y_(U) from thegenerative model:

$\begin{matrix}\begin{matrix}{{P( y_{U} \middle| x )} = {\prod\limits_{m \in U}{P( y_{m} \middle| x )}}} \\{= {N( { y_{U} \middle| \mu_{y_{U}} ,S_{y_{U}}} )}} \\{= {\frac{1}{( {2\pi} )^{d/2}{S_{y_{U}}}^{1/2}}\exp}} \\{( {{- \frac{1}{2}}( {y_{U} - \mu_{y_{U}}} )^{T}{S_{y_{U}}^{- 1}( {y_{U} - \mu_{y_{U}}} )}} )}\end{matrix} & (10)\end{matrix}$

This is a Gaussian distribution with mean μ_(y) _(U) and a diagonalcovariance matrix S_(y) _(U) , which may be determined by the trainingprocess. EQ. (10) has a factorized form due to the independenceassumption. Second, each unknown output y_(m) is predicted in turn usingEQ. (2). The issue of how to perform the Bayesian integrationanalytically will be described in the next section.

Alternatively, EQ. (10) can be simplified such that the distribution ofeach y_(m) shrinks to a point:

P(y _(m) |x)=δ(y _(m)−μ_(y) _(m) )  (11)

The delta function δ( ) can be viewed as a probability density functionwith ∫_(y) _(m) δ(y_(m)−μ_(y) _(m) )dy_(m)=1. It is defined asδ(y_(m)−μ_(y) _(m) )=∞ if y_(m)=μ_(y) _(m) and δ(y_(m)−μ_(y) _(m) )=0otherwise. This is equivalent to assuming that all variances of theGaussian distributions in EQ. (10) go to zero. In this case, EQ. (2) issimply P(y_(m)|x, y_(O))=P(y_(m)|x, μ_(y) _(U m) , y_(O)), which is aGaussian with mean μ_(y) _(U m) .

B. An Iterative Algorithm

If P(y_(m), x, y_(O)), the result from the above one pass algorithm, isviewed as a refinement for P(y_(m)|x), the predictive distribution fromthe generative model, can one continue refining the result iteratively?To achieve this, according to an embodiment of the invention, it isassumed that P(y_(U)|x, y_(O)) is unknown and can be factorized as

$\begin{matrix}{{P( { y_{U} \middle| x ,y_{o}} )} \eqsim {\prod\limits_{m \in U}{{Q( y_{m} )}.}}} & (12)\end{matrix}$

Q(y_(m)) is an approximation to the ground truth posterior distributionP(y_(m)|x, y_(O)). EQ. (2) can now be written into the followingiterative form:

$\begin{matrix}{{Q^{({l + 1})}( y_{m} )} = {\int_{y_{U\; \overset{\_}{m}}}^{\;}{{P( { y_{m} \middle| x ,y_{\overset{\_}{m}}} )}{\prod\limits_{i \in U_{\overset{\_}{m}}}{{Q^{(l)}( y_{i} )}{{y_{U\; \overset{\_}{m}}}.}}}}}} & (13)\end{matrix}$

Q^((l+1))(y_(m)) denotes the result after l+1 iterations. In the(l+1)-th iteration, all posterior distributions Q(y_(m)) are updated,where mεU, using the results from the l-th iteration.

An iterative algorithm according to an embodiment of the invention maybe initialized by setting Q⁽⁰⁾(y_(m))=P(y_(m)|x), the predictivedistribution from the generative model. Thus, the first iteration is thesame as that in a one pass algorithm according to an embodiment of theinvention. After this, the generative model is not needed. Once thealgorit

FIG. 3 is a flowchart of an iterative method for conditionalmulti-output regression, according to an embodiment of the invention.Referring now to the figure, after providing, at step 30, a set of oneor more test input values, and a set of known sensor output values, aniterative algorithm according to an embodiment of the invention beginsat step 31 by initializing each Q⁽⁰⁾(y_(m))=P(y_(m)|x), the predictivedistribution from the generative model. Then, at step 32,

${Q^{(l)}( y_{m} )} = {\int_{y_{U\; \overset{\_}{m}}}^{\;}{{P( { y_{m} \middle| x ,y_{\overset{\_}{m}}} )}{\prod\limits_{i \in U_{\overset{\_}{m}}}{{Q^{(0)}( y_{i} )}{y_{U\; \overset{\_}{m}}}}}}}$

is calculated using conditional distribution P(y_(m)|x, y _(m) ). Atstep 33, Q⁽¹⁾(y_(m)) and Q⁽⁰⁾(y_(m)) are compared for convergence, andif the difference is insufficiently small, step 32 is repeated for eachy_(m), using the previous result for Q in the product in right handside. Once the Q^((l+1))(y_(m)) have converged, P(y_(m)|x,y_(O))=Q(y_(m)) is output at step 34.

If a Dirac delta function is used for the generative model, the varianceof the Gaussian result for P(y_(m)|x, y_(O))=P(y_(m)|x, μ_(y) _(U m) ,y_(O)) is set to zero, and the resulting delta function is used toinitialize the Q⁽¹⁾(y_(m)) in the product on the right-hand-side of EQ.(13). This is repeated after every iteration so that every Gaussiandistribution on the left side of EQ. (13) becomes a delta function onthe right side of EQ. (13) in the next iteration.

This iterative algorithm according to an embodiment of the inventionconverges very well in practice. Unlike prior approximate inferencemethods that require the full joint distribution of variables to bespecified, for example, in a form of Bayesian network or Markov network,a method according to an embodiment of the invention does not need tofind the real dependencies among variables.

There is a special case when no additional outputs are known, i.e., O isempty. In this case, the advantage of using the conditional model (usinginformation from extra outputs) is lost. Therefore, a method accordingto an embodiment of the invention would not be expected to outperformother generative models in this scenario.

There is also a trivial case when M−1 outputs are known. In this case,there is only one output to predict and there are no uncertainties fromthe other outputs. Thus, the generative model is not needed and theBayesian integration produces the same results as a point estimation. Aniterative algorithm according to an embodiment of the invention is alsounnecessary.

C. Bayesian Integration

The question of how to perform the Bayesian integration in EQ. (2)analytically will now be addressed. From the conditional model,P(y_(m)|x, y_(U m) ) has a Gaussian distribution of

${N( { y_{m} \middle| \mu_{y_{m}} ,s_{y_{m}}} )} = {\frac{1}{( {2\pi} )^{d/2}{s_{y_{m}}}^{1/2}}{{\exp ( {{- \frac{1}{2}}( {y_{m} - \mu_{y_{m}}} )^{T}( s_{y_{m}} )^{- 1}( {y_{m} - \mu_{y_{m}}} )} )}.}}$

Specifically,

$\begin{matrix}{{\mu_{y_{m}} = {\sum\limits_{i = 1}^{N_{m}}{\beta_{i}{k( {z, z_{m,i} \middle| \Lambda_{z} } )}}}},} & (14) \\{{s_{y_{m}} = {\sigma_{y_{m}}^{2} - {\sum\limits_{i = 1}^{Nm}{\sum\limits_{j = 1}^{N_{m}}{\gamma_{ij}{k( {z, z_{m,i} \middle| \Lambda_{z} } )}{k( {z, z_{m,j} \middle| \Lambda_{z} } )}}}}}},} & (15)\end{matrix}$

where z is a combined vector of [x^(T)y_(O) ^(T)y_(U m) ^(T)]^(T) and

k(z,z _(m,n)|Λ_(z))=K(x,x _(m,n)|Λ_(x))k(y _(O) ,y _(O,n)|Λ_(y) _(O))k(y _(U m) ,y _(U m,n)|Λ_(y) _(U m) )  (16)

z_(m,i) (or y_(m,i), y_(O,i) and y_(U m,i)) are all known training data.According to an embodiment of the invention, it is assumed that thetraining data are complete, and the case of missing data case isaddressed at the end of this subsection. Hyperparameters Λ_(x), Λ_(y)_(O) , Λ_(y) _(U m) contain squared length scales for correspondingdimensions and σ_(y) _(m) ² is the noise variance. These are all learnedand fixed after training. The results of EQ. (14) and EQ. (15) come fromthe standard Gaussian process predictive distribution. Both β_(i) andγ_(ij) are constant after training.

From the generative model, P(y_(U m) |x, y_(O)) is another Gaussiandistribution (see EQ. (10)):

P(y _(U m) |x,y _(O))=N(y _(U m) |μy _(U m) ,S _(y) _(U m) )∝k(y _(U m),μ_(y) _(U m) |S _(y) _(U m) ),  (17)

where S_(y) _(U m) denotes the diagonal covariance matrix.Unfortunately, the integral in EQ. (2) is now analytically intractablebecause y_(U m) is embedded in the kernel functions in EQS. (14) and(15).

According to an embodiment of the invention, EQ. (2) can be approximatedby a Gaussian distribution:

P(y _(m) |x,y _(O))≈q(y _(m))=N(y _(m) |v _(y) _(m) ,ξ_(y) _(m) ). (18)

This breaks down to matching the mean and variance between q(y_(m)) andP(y_(m)|x, y_(O)) in EQ. (2). For calculating mean V_(y) _(m) , afterfirst integrating over y_(m)t, one has

v _(y) _(m) =∫_(y) _(U m) μ_(ym) N(y _(U m) |μ_(y) _(U m) ,S _(y) _(U m))dy _(U m) .  (19)

Using the fact that ∫_(u)k(x₁, u|Λ₁)k(u, x₂|Λ₂)du∝k(x₁, x₂|Λ₁+Λ₂), v_(y)_(m) can be proved to have an analytical form:

$\begin{matrix}{{v_{y_{m}} = {\sum\limits_{n = 1}^{N_{m}}{\beta_{n}{{{\Lambda_{{yU}\; \overset{\_}{m}}^{- 1}S_{y_{U\; \overset{\_}{m}}}} + I}}^{{- 1}/2}}}}{k( {x, x_{m,n} \middle| \Lambda_{x} } )}{k( {y_{O}, y_{O,n} \middle| \Lambda_{y_{o}} } )}{k( {\mu_{y_{U\; \overset{\_}{m}}}, y_{{U\; \overset{\_}{m}},n} \middle| {\Lambda_{y_{U\; \overset{\_}{m}}} + S_{y_{U\; \overset{\_}{m}}}} } )}} & (20)\end{matrix}$

EQ. (20) implies that if the prediction of y_(U m) is accurate withsmall variances in the diagonal of S_(y) _(U m) , then S_(y) _(U m)becomes negligible compared to Λ_(y) _(U m) , and the posterior meanv_(y) _(m) is approximately the same as EQ. (14), with y_(U m) replacedby μ_(y) _(U m) . This is exactly the point estimation case. If anelement in y_(U m) has a relatively large variance in the correspondinglocation of S_(y) _(U m) , the corresponding dimension become lessrelevant in evaluating EQ. (20) and at the same time the whole value ofEQ. (19) shrinks a little towards 0, the prior mean of the Gaussianprocess.

Recall that if training data contain missing entries, sampling isperformed to augment the training data. For each of the T=10 sampled Y_(m) ^((t)), one obtains one predictive distribution q(y_(m)) in EQ.(18). These ten results are then averaged to give the final predictivedistribution.

Test Results

Tests were conducted on the Jura dataset, a robot arm dataset and atruck dataset. For all datasets, each data dimension were normalized tozero mean and unit variance for training and testing. The standardizedmean squared error (MSE) calculated in the normalized domain was used toevaluate results. For the Jura data only, to be able to compare withexisting published results, a mean absolution error (MAE) was used,calculated after results are unnormalized into their original scales.

The following methods are compared (GP stands for Gaussian process). Thelast four are methods according to embodiments of the invention.

Single Task (ST) GP: This model learns each output independently as asingle task, and is equivalent to a generative model according to anembodiment of the invention.

Multi-Task (MT) GP: This model extends a GP to multi-task work, whichperforms clusterings on tasks (outputs), and in which the logisticregression classifier is replaced by a GP. In prediction, each GP worksindependently.

Bonilla GP: This is an implementation of the method disclosed inBonilla, et al., “Multi-task Gaussian Process Prediction”, Advances inNeural Information Processing Systems 20, MIT Press, 2008, the contentsof which are herein incorporated by reference in their entirety. Thismethod uses a single generative model to model the joint predictivedistribution of all outputs. Specifically, all output training samplesare modeled by a single Gaussian process. Unknown outputs are thenpredicted conditionally from the known outputs. A Cholesky decompositionis used for its K^(f) matrix and no sparse approximation is considered.

One-pass CG GP: This is a one-pass version of a method according to anembodiment of the invention using GP in both conditional and generativemodels with Bayesian integration.

One-pass Point CG GP: This is the same as the above method except thatonly point estimation (using delta function) is considered.

Iterative CG GP. This is the iterative version of a method according toan embodiment of the invention with Bayesian integration.

Iterative Point CG GP. This is the same as above method except usingpoint estimation.

Conditional model predictors and simple generative model predictors weretrained independently. Training can be conducted in an EM-likealgorithm. In the E-step, the posterior distribution Q(y_(m)) is refinedin an iterative algorithm according to an embodiment of the invention.In the M-step, Q(y_(m)) is used to sample the missing data and learn theparameters of the predictors. This training scheme can also be viewed asa transductive test, as the models are trained and the missing entriesare predicted simultaneously.

A. Jura data: The Jura dataset (www.ai-geostats.org) comprisesmeasurements of concentrations of seven heavy metals (cadmium, cobalt,chromium, copper, nickel, lead and zinc) in the Swiss Jura Mountains.There are a total of 359 locations, divided into a fixed training set(259 locations) and a test set (100 locations). The following threetests were considered. Each is to predict the concentration of a primaryvariable from the concentrations of some secondary variables and the x,y coordinates at a location. For test 1, the primary variable iscadmium, and the secondary variables are nickel and zinc. For test 2,the primary variable is copper, and the secondary variables are lead,nickel and zinc. For test 3, the primary variable is lead, and thesecondary variables are copper, nickel and zinc. These variable settingsare displayed in the table of FIG. 4.

A first set of the above three tests were repeated in an inductivemanner, in which only the training set is available during training.During testing, given the test location, the concentration of theprimary variable was predicted using known secondary variables. Forexample, in test 1, there are d=2 inputs and M=3 outputs, and there isonly one output, cadmium, to predict given a test location and other twooutputs, nickel and zinc.

This is an ideal test to answer the question raised above: when does itmake sense to use outputs as extra inputs? Because the training datahave no missing values and all secondary variables are known during thetesting stage, a generative model according to an embodiment of theinvention is not needed here. Under this setting, all methods producethe same result. Thus, only the results from a One-pass CG GP accordingto an embodiment of the invention are rerouted. The table of FIG. 5shows the MAE results for different methods. In addition, the bestscores using the Co-kriging method in an isotopic situation, disclosedin P. Goovaerts, “Ordinary cokriging revisited”, Mathematical Geology,30:21-42, 1998, the contents of which are herein incorporated byreference in their entirety, are reported.

A second set of the above three tests were repeated under a transductivesetting (referred to as a heteropic situation and a collocated situationin Goovaerts), in which both training data and the secondary variabletest data are available during training. The task is to simultaneouslyperform training and predict the missing primary variable in the testdata. Recall that algorithms according to embodiments of the inventionare formulated under an inductive setting: models according toembodiments of the invention are fixed during testing. Methods accordingto embodiments of the invention are adapted under the transductivesetting by removing values of the primary variable at test locationsduring training, and by simulating the missing data cases.

The table of FIG. 6 shows the results. The table also reports the bestmean scores from a full convolved GP disclosed in M. Alvarez and N. D.Lawrence, “Sparse convolved Gaussian processes for multi-outputregression”, Advances in Neural Information Processing Systems 21, MITPress, 2009, the contents of which are herein incorporated by referencein their entirety, which does not conduct test three and does notconsider the inductive setting. A One-pass CG GP method according to anembodiment of the invention performs the best in the inductive test andproduces top scores comparable to those from Co-kriging in thetransductive test. Besides coordinates of a location, the real inputs todetermine the metal concentration should include earth activity,pollution and weather history. Such missing information is carried byother outputs (concentrations of other metals) and can be recovered ifthose known outputs are used as extra inputs.

Most methods perform better under the transductive setting than underthe inductive setting, because of the extra information. The Bonilla GP,the convolved GP and the Cokriging are very competitive on this datasetand all of them can be viewed as using Gaussian process to model thejoint predictive distribution. This seems to be a good generative modelfor this Jura dataset.

B. Robot arm data: A dataset generated from a realistic simulation ofthe dynamics of a Puma 560 robot arm(www.cs.toronto.edui-delve/data/pumadyn/desc.html) was used. The code isbased on the MATLAB code for Pumadyn-8 written by Zoubin Ghahramani.Specifically, four joints were considered, where each joint has anangle, an angular velocity and a torque, for a total of d=4×3=12dimensional inputs. There are four outputs of the system, eachindicating the angular acceleration of a joint. This thus forms an M=4multi-task problem. The γ, a parameter for unpredictability, is set to0.3.

A total of 1000 data points were generated, which were split equallyinto a training set and a test set. The complete training data was usedfor training, without missing values. Therefore, no sampling is done intraining. But the number of known outputs M_(O) was varied from 0 to 3.For example, when M_(O)=2, the inputs and two randomly selected knownoutputs were used to predict each of the two unknown outputs. For eachM_(O) and each method, a total of ten random runs are performed.

FIG. 7 shows the standardized MSE results, with one standard deviationerror bar, for the different methods in the Puma560 test, as a functionof the number of known outputs M_(O). In the figure, the plots are: (71)ST GP; (72) MT GP; (73) Bonilla GP; (74) One-pass CG GP; (75) One-passPoint CG GP; (76) Iterative CG GP; and (77) Iterative Point CG GP. TheST GP turns out to be a very good performer in this test. Both the MT GPand Bonilla GP perform worse even though they represent more advancedtechniques. This implies that the angular accelerations at differentjoints behave so differently that modeling them jointly may be a ratherchallenging task. Note that the results from ST GP and MT GP do notchange with M_(O).

The ST GP also performs better than methods according to embodiments ofthe invention at M_(O)=0. This is not unexpected because in this case nooutputs are known and the benefit of using the conditional model islost. However, as M_(O) increases, methods according to embodiments ofthe invention start to take advantages of the additional information andperform better.

Among the methods according to embodiments of the invention, those usingBayesian integration such as One-pass CG GP perform better than thepoint estimation counterparts such as One-pass Point CG GP. Theiterative method does not show clear advantage over its one-passcounterpart. This may be attributed to the good performance of the STGP, which is also used in a generative model according to an embodimentof the invention. With a good initialization from the ST GP, aniterative method according to an embodiment of the invention shouldconverge very fast such that the final result is close to that from thefirst iteration. At M_(O)=M−1=3, the results from methods according toembodiments of the invention are identical because no Bayesianintegration and no iterations are performed.

Truck data: Monitoring the condition of expensive devices such as powerplants and airplanes has received increasing interests in recent years.The goal is to detect failures of these machines at an early stage toavoid catastrophic damages. This is usually achieved by analyzing thevalues from a set of sensors installed in different parts of a machine.When correlations among the sensor values are violated, there is likelyto be a failure. One important aspect of modeling such correlations isthe ability to accurately predict process outputs based on processinputs.

However, sensors can fail, resulting in missing values in both trainingand test data. This naturally forms a conditional multi-outputregression task that uses available sensor values to predict missingsensor values. In this test, according to an embodiment of theinvention, the availability of all process inputs is assumed, asotherwise, an extra generative model is needed for x, as noted above.

28-hour operation data from a truck manufactured by a Europeanautomobile company was used. There are a total of 20 sensors installedin this truck. The following d=6 sensors were selected as inputs:ambient air temperature, ambient air pressure, altitude, fuel flow,throttle and engine speed (rpm). The M=8 outputs include road speed,engine temperature, intake manifold temperature, turbo boost pressure,charge air cooler temperature and pressure, exhaust temperature andpressure. During this operation time, the truck was running at altitudesbetween 100 meters and 900 meters and at speeds between 0 mph and 85mph, providing a comprehensive snapshot of its performance. Note thatthe time when the engine was turned off was not considered.

There are a total of 825 data points with a resolution of about twominutes. 200 data points were randomly selected for training, with theremaining used for testing. For the 200 training data points, 5% of theoutput values were randomly removed to simulate the missing data cases.Note that 5% is typically much higher than the rate of a network failureor a sensor failure. This setting makes the tests more challenging thanrealistic applications. The number of known outputs M_(O) was alsovaried from 0 to M−1=7. Such a test was repeated ten times.

FIGS. 8( a)-(b) illustrate the truck data test. FIG. 8( a) is a graph ofthe road speed (miles per hour) of the truck over the 825 data points ofthe 28-hour operation. FIG. 8( b) is a graph of the standardized MSEresults (with a one standard deviation error bar) for the differentmethods with the number of known outputs M_(O) varied from 0 to 7. InFIG. 8( b), the plots are: (81) ST GP; (82) MT GP; (83) Bonilla GP; (84)One-pass CG GP; (85) One-pass Point CG GP; (86) Iterative CG GP; and(87) Iterative Point CG GP.

FIG. 8( b) shows the results of different algorithms. A One-pass CG GPaccording to an embodiment of the invention outperforms all previousmethods. Iterative algorithms according to embodiments of the inventiongenerally perform better than one-pass algorithms according toembodiments of the invention, especially when M_(O) is large. However,when M_(O) is small, such as 0 or 1, a One-pass CG GP according to anembodiment of the invention actually performs better than an IterativeCG GP according to an embodiment of the invention. This can beattributed to the fact that the available information is so limited thatthe iterative approach tends to become trapped in a local optima.

A One-pass Point CG GP according to an embodiment of the inventionperforms worse than other methods according to embodiments of theinvention for almost all M_(O) values. Its iterative version, anIterative Point CG GP according to an embodiment of the invention issimilarly poor at small M_(O) values. However, as M_(O) increases, itsperformance quickly approaches that of an iterative CG GP according toan embodiment of the invention (with Bayesian integration). This impliesthat with more outputs and thus more information available, theestimated posterior distribution Q(y_(m)) becomes more sharply peakedand can be better approximated by the delta function used by the pointestimation algorithm.

In general, the performances of all multi-output regression methodsimprove with the increase of observed outputs because more informationis available. However, this is not always true. For example, fromM_(O)=5 to M_(O)=8, the errors of a One-pass CG GP according to anembodiment of the invention and an Iterative CG GP according to anembodiment of the invention actually increase. This may be attributed tothe fact that some outputs may be irrelevant for predicting a targetoutput. The inclusion of such outputs, which become more likely with alarger M_(O), may add noise to affect the performance of the regressionalgorithm.

Both an MT GP according to an embodiment of the invention and a BonillaGP, two multi-task approaches, now perform better than the ST GP. Thereason can be that there are several correlated temperature and pressuresensors used as outputs. It makes more sense to share parameterrepresentation using multi-task approaches. It is also easier to modeltheir joint predictive distribution.

In practice, the following strategy may be used to select amongalgorithms according to embodiments of the invention. When the number ofknown outputs M_(O) is small, a One-pass CG GP algorithm according to anembodiment of the invention is used. For larger M_(O) values, anIterative CG GP algorithm according to an embodiment of the invention oran Iterative Point CG GP algorithm according to an embodiment of theinvention is used.

System Implementations

It is to be understood that embodiments of the present invention can beimplemented in various forms of hardware, software, firmware, specialpurpose processes, or a combination thereof. In one embodiment, thepresent invention can be implemented in software as an applicationprogram tangible embodied on a computer readable program storage device.The application program can be uploaded to, and executed by, a machinecomprising any suitable architecture.

FIG. 9 is a block diagram of an exemplary computer system forimplementing a method for conditional multi-output regression accordingto an embodiment of the invention. Referring now to FIG. 9, a computersystem 91 for implementing the present invention can comprise, interalia, a central processing unit (CPU) 92, a memory 93 and aninput/output (I/O) interface 94. The computer system 91 is generallycoupled through the I/O interface 94 to a display 95 and various inputdevices 96 such as a mouse and a keyboard. The support circuits caninclude circuits such as cache, power supplies, clock is circuits, and acommunication bus. The memory 93 can include random access memory (RAM),read only memory (ROM), disk drive, tape drive, etc., or a combinationsthereof. The present invention can be implemented as a routine 97 thatis stored in memory 93 and executed by the CPU 92 to process the signalfrom the signal source 98. As such, the computer system 91 is a generalpurpose computer system that becomes a specific purpose computer systemwhen executing the routine 97 of the present invention.

The computer system 91 also includes an operating system and microinstruction code. The various processes and functions described hereincan either be part of the micro instruction code or part of theapplication program (or combination thereof) which is executed via theoperating system. In addition, various other peripheral devices can beconnected to the computer platform such as an additional data storagedevice and a printing device.

It is to be further understood that, because some of the constituentsystem components and method steps depicted in the accompanying figurescan be implemented in software, the actual connections between thesystems components (or the process steps) may differ depending upon themanner in which the present invention is programmed. Given the teachingsof the present invention provided herein, one of ordinary skill in therelated art will be able to contemplate these and similarimplementations or configurations of the present invention.

While the present invention has been described in detail with referenceto exemplary embodiments, those skilled in the art will appreciate thatvarious modifications and substitutions can be made thereto withoutdeparting from the spirit and scope of the invention as set forth in theappended claims.

1. A computer-implemented method of predicting sensor output values of asensor monitoring system, comprising the steps of: providing a set ofone or more test input values to a system of sensors, and one or moreknown sensor output values from said sensor system, wherein other sensoroutput values are unknown; calculating, for each unknown sensor outputvalue, a predictive Gaussian distribution function $\begin{matrix}{{P( y_{U} \middle| x )} = {\prod\limits_{m \in U}{P( y_{m} \middle| x )}}} \\{= {\frac{1}{( {2\pi} )^{d/2}{S_{y_{U}}}^{1/2}}{\exp ( {{- \frac{1}{2}}( {y_{U} - \mu_{y_{U}}} )^{T}{S_{y_{U}}^{- 1}( {y_{U} - \mu_{y_{U}}} )}} )}}}\end{matrix}$ from the test input values, wherein vector x ofdimensionality d represents the test input values, vector y_(U) ofdimensionality M represents the set of unknown output sensor values,y_(m)εy_(U), μ_(y) _(U) is a vector of mean values of the set y_(U)determined by a training phase, S_(y) _(U) is a diagonal covariancematrix of the set y_(U) determined by said training phase; andpredicting each unknown output y_(m) from P(y_(m)|x, y_(O))=∫_(y) _(U m)P(y_(m)|x, y_(U m) )P(y_(U)|x)dy_(U m) , wherein vector y_(O) representsthe known sensor output values, vector y_(U m) represents unknown outputsensor values in y_(U) except y_(m,), and P(y_(m)|x, y_(U m) ) is aconditional Gaussian distribution defined bylog P(y _(m) |x,y _(U m) )=−½ log|K|−½y _(U m) K ⁻¹ y _(U m) ^(T)+C,wherein C=−(0.5 d)log(2π) and K is an N×N kernel matrix definedbetween pairs of test input values x_(i), x_(j) wherein N is the numberof test input values whose elements K_(i,j) are defined by Gaussiankernel functions K_(i,j)=k(x_(i),x_(j)|Λ)=exp(−½(x_(i)−x_(j))^(T)Λ⁻¹(x_(i)−x_(j))), wherein Λ=diag[λ₁ ²,. . . , λ_(d) ²]^(T) whose values λ_(i) are determined by anothertraining phase.
 2. The method of claim 1, wherein the predictiveGaussian distribution function P(y_(m)|x) for each unknown output y_(m)is trained by maximizing log P(Y_(m)|X_(m), θ)=−½log|K|−½Y_(m)K⁻¹Y_(m)Y_(m) ^(T)+C with respect to a hyperparameterθ={λ₁, . . . , λ_(d)}, wherein Y_(m) is a set of N_(m) training samplesof dimensionality M for sensor output values corresponding to a setX_(m) of N training samples for sensor input values.
 3. The method ofclaim 2, wherein log P(Y_(m)|X_(m), θ)=−½ log|K|−½Y_(m)K⁻¹Y_(m) ^(T)+Cis maximized using a conjugate gradient method.
 4. The method of claim2, wherein the conditional Gaussian distribution P(y_(m)|x, y_(U m) ,y_(O)) is trained by maximizing log P(Y_(m)|X, Y _(m) , θ)=½log|K|−½y_(U m) K⁻¹y_(U m) ^(T)+C with respect to a hyperparameterθ={λ₁, . . . , λ_(d+M-1)}, for the training set of input values X andoutput values Y, wherein Y _(m) is a (M−1)×N matrix that representstraining sets for all outputs except y_(m), wherein the Gaussian kernelfunctions K_(i,j)=k(z_(i),z_(j)|Λ)=exp(−½(z_(i)−z_(j))^(T)Λ⁻¹(z_(i)−z_(j))) wherein z is a (d+M−1)input vector that combines vectors xεX and y _(m) εY _(m) .
 5. Themethod of claim 4, further comprising, when there are N−N_(m) missingvalues in the set of output values Y_(m), sampling a plurality of outputvalues from the predictive Gaussian distribution P(y_(m)|x) for eachmissing value in y_(m), and maximizing an average over the plurality ofsampled values$\frac{1}{T}{\sum\limits_{t = 1}^{T}{\log \; {P( { Y_{m}^{(t)} \middle| X_{m} ,Y_{\overset{\_}{m}}^{(t)},\theta} )}}}$with respect to the hyperparameter θ, wherein T is the number of theplurality of sampled values for each missing value in Y_(m), whereineach missing value in Y_(m) is replaced by a sampled value.
 6. Themethod of claim 1, further comprising repeatedly calculating${Q^{({l + 1})}( y_{m} )} = {\int_{y_{U\; \overset{\_}{m}}}^{\;}{{P( { y_{m} \middle| x ,y_{\overset{\_}{m}}} )}{\prod\limits_{i \in U_{\overset{\_}{m}}}{{Q^{(l)}( y_{i} )}{y_{U\; \overset{\_}{m}}}}}}}$until convergence, wherein each Q⁽¹⁾(y_(m)) is initialized asQ⁽¹⁾(y_(m))=P(y_(m)|x, y_(O))=∫_(y) _(U m) P(y_(m)|x, y_(U m))P(y_(U)|x)dy_(U m) .
 7. A computer-implemented method of predictingsensor output values of a sensor monitoring system, comprising the stepsof: providing a set of one or more test input values to a system ofsensors, and one or more known sensor output values from said sensorsystem, wherein other sensor output values are unknown; calculating, foreach unknown sensor output value, a predictive distribution function${P( y_{U} \middle| x )} = {\prod\limits_{m \in U}{P( y_{m} \middle| x )}}$from the test input values wherein P(y_(m)|x)=δ(y_(m)−μ_(y) _(m) ), theDirac delta function, wherein vector x of dimensionality d representsthe test input values, vector y_(U) of dimensionality M represents theset of unknown output sensor values, y_(m)εy_(U), μ_(y) _(U) is a vectorof mean values of the set y_(U) determined by a training phase; andpredicting each unknown output y_(m)εy_(U) from P(y_(m)|x,y_(O))=P(y_(m)|x, μ_(y) _(U m) , y_(O)), wherein vector y_(O) representsthe plurality of known sensor output values, vector y_(U m) representsunknown output sensor values in y_(U) except y_(m,), and P(y_(m)|x,μ_(y) _(U m) , y_(O)) is a conditional Gaussian distribution defined bylog P(y_(m)|x, μ_(y) _(U m) , y_(O))=−½ log|K|−½μ_(y) _(U m) K⁻¹μ_(y)_(U m) +C, wherein C=−(0.5 d)log(2π) is a constant and K is an N by Nkernel matrix defined between pairs of test input values x_(i), x_(j)wherein N is the number of test input values whose elements K_(i,j) aredefined as K_(i,j)=k(x_(i),x_(j)|Λ)=exp(−½(x_(i)−x_(j))^(T)Λ⁻¹(x_(i)−x_(j))), wherein Λ=diag[λ₁ ²,. . . , λ_(d) ²]^(T) whose values are determined by another trainingphase.
 8. The method of claim 7, further comprising: (a) setting to zeroa variance of the Gaussian distribution P(y_(m)|x, y_(O))=P(y_(m)|x,μ_(y) _(U m) , y_(O)) wherein P(y_(m)|x, y_(O)) becomes a Dirac deltafunction, (b) calculating${{Q^{({l + 1})}( y_{m} )} = {\int_{y_{U\; \overset{\_}{m}}}^{\;}{{P( { y_{m} \middle| x ,y_{\overset{\_}{m}}} )}{\prod\limits_{i \in U_{\overset{\_}{m}}}{{Q^{(l)}( y_{i} )}{y_{U\; \overset{\_}{m}}}}}}}},$wherein each Q⁽¹⁾(y_(m)) is initialized as Q⁽¹⁾(y_(m))=P(y_(m)|x,y_(O))=δ(y_(m)−μ_(y) _(m) ), and repeating steps (a) and (b) untilconvergence.
 9. A program storage device readable by a computer,tangibly embodying a program of instructions executable by the computerto perform the method steps for predicting sensor output values of asensor monitoring system, the method comprising the steps of: providinga set of one or more test input values to a system of sensors, and oneor more known sensor output values from said sensor system, whereinother sensor output values are unknown; calculating, for each unknownsensor output value, a predictive Gaussian distribution function$\begin{matrix}{{P( y_{U} \middle| x )} = {\prod\limits_{m \in U}{P( y_{m} \middle| x )}}} \\{= {\frac{1}{( {2\pi} )^{d/2}{S_{y_{U}}}^{1/2}}{\exp ( {{- \frac{1}{2}}( {y_{U} - \mu_{y_{U}}} )^{T}{S_{y_{U}}^{- 1}( {y_{U} - \mu_{y_{U}}} )}} )}}}\end{matrix}$ from the test input values, wherein vector x ofdimensionality d represents the test input values, vector y_(U) ofdimensionality M represents the set of unknown output sensor values,y_(m)εy_(U), μ_(y) _(U) is a vector of mean values of the set y_(U)determined by a training phase, S_(y) _(U) is a diagonal covariancematrix of the set y_(U) determined by said training phase; andpredicting each unknown output y_(m) from P(y_(m)|x, y_(O))=∫_(y) _(U m)P(y_(m)|x, y_(U m) )P(y_(U)|x)dy_(U m) , wherein vector y_(O) representsthe known sensor output values, vector y_(U m) represents unknown outputsensor values in y_(U) except y_(m,), and P(y_(m)|x, y_(U m) ) is aconditional Gaussian distribution defined bylog P(y _(m) |x,y _(U m) )=−½ log|K|−½y _(U m) K ⁻¹ y _(U m) ^(T)+C,wherein C=−(0.5 d)log(2π) and K is an N×N kernel matrix definedbetween pairs of test input values x_(i), x_(j) wherein N is the numberof test input values whose elements K_(i,j) are defined by Gaussiankernel functions K_(i,j)=k(x_(i)x_(j)|Λ)=exp(−½(x_(i)−x_(j))^(T)Λ⁻¹(x_(i)−x_(j))), wherein Λ=diag[λ₁ ²,. . . , λ_(d) ²]^(T) whose values λ_(i) are determined by anothertraining phase.
 10. The computer readable program storage device ofclaim 9, wherein the predictive Gaussian distribution functionP(y_(m)|x) for each unknown output y_(m) is trained by maximizing logP(Y_(m), X_(m), θ)=−½ log|K|−½Y_(m)K⁻¹Y_(m) ^(T)+C with respect to ahyperparameter θ={λ₁, . . . , λ_(d)}, wherein Y_(m) is a set of N_(m)training samples of dimensionality M for sensor output valuescorresponding to a set X_(m) of N training samples for sensor inputvalues.
 11. The computer readable program storage device of claim 10,wherein log P(Y_(m)|X_(m), θ)=−½ log|K|−½Y_(m)K⁻¹Y_(m) ^(T)+C ismaximized using a conjugate gradient method.
 12. The computer readableprogram storage device of claim 10, wherein the conditional Gaussiandistribution P(y_(m)|x, y_(U m) , y_(O)) is trained by maximizing logP(Y_(m)|X_(m), θ)=−½ log|K|−½Y_(m)K⁻¹Y_(m) ^(T)+C with respect to ahyperparameter θ={λ₁, . . . , λ_(d+M-1)}, for the training set of inputvalues X and output values Y, wherein Y _(m) is a (M−1)×N matrix thatrepresents training sets for all outputs except y_(m), wherein theGaussian kernel functions K_(i,j)=k(z_(i),z_(j)|Λ)=exp(½(z_(i)−z_(j))^(T)Λ⁻¹(z_(i)−z_(j))) wherein z is a (d+M−1)input vector that combines vectors xεX and y _(m) εY _(m) .
 13. Thecomputer readable program storage device of claim 12, the method furthercomprising, when there are N−N_(m) missing values in the set of outputvalues Y_(m), sampling a plurality of output values from the predictiveGaussian distribution P(y_(m)|x) for each missing value in Y_(m), andmaximizing an average over the plurality of sampled values$\frac{1}{T}{\sum\limits_{t = 1}^{T}{\log \; {P( { Y_{m}^{(t)} \middle| X_{m} ,Y_{\overset{\_}{m}}^{(t)},\theta} )}}}$with respect to the hyperparameter θ, wherein T is the number of theplurality of sampled values for each missing value in Y_(m), whereineach missing value in Y_(m) is replaced by a sampled value.
 14. Thecomputer readable program storage device of claim 9, the method furthercomprising repeatedly calculating${Q^{({l + 1})}( y_{m} )} = {\int_{y_{U\; \overset{\_}{m}}}^{\;}{{P( { y_{m} \middle| x ,y_{\overset{\_}{m}}} )}{\prod\limits_{i \in U_{\overset{\_}{m}}}{{Q^{(l)}( y_{i} )}{y_{U\; \overset{\_}{m}}}}}}}$until convergence, wherein each Q⁽¹⁾(y_(m)) is initialized asQ ⁽¹⁾(y _(m))=P(y _(m) |x,y _(O))=∫_(y) _(U m) P(y _(m) |x,y _(U m) )P(y_(U) |x)dy _(U m) .